Support vector machines can be applied in powerful classification algorithms to form intricate boundaries for delineating between multiple classes. One of the main advantages of support vector classifiers (SVC's) is that fitting occurs with implicit transformation of pre-existing features into a higher-dimensional space. Thus, non-linear combinations of features can be surveyed without the need for them to be explicitly created. This report describes the effects of SVC parameters on model fitting, and demonstrates an iterative scanning method to determine the optimal values of these parameters using an example case involving classification of Iris flower species.
Python code for this interactive optimization can be found on this GitHub repository, which includes a iPython notebook demonstration of this process. For best results with the interactive plots, please view this page on a computer, rather than a mobile device.
One of the main strengths of SVC's lies in the use of kernels, functions which transform the input data into a new basis signal for classification. Two popular kernel choices are linear and Gaussian (the latter is also referred to as "radial basis function" or RBF in SVC contexts). As demonstrated in the simulated data below, Gaussian kernels can provide flexibility in the shape of the separating barrier (also known as the "decision boundary"), while linear kernels only allow for separation along linear boundaries.
Non-linear decision boundaries can be attained through simpler classification algorithms, such as logistic regression. However, this is only possible with the creation of polynomial combinations of the pre-existing features. This, in turn, requires optimization of the degree of the applied polynomial, and consumes additional memory and computation resources as the new features must be calculated, scaled, and stored. SVC's with Gaussian kernels circumvent these issues, and therefore allow for classification along non-linear decision boundaries with minimized effort and sources of error.
As with many other learning algorithms, SVC's make use of a constant input parameter to fine-tune the sensitivity of the fitting process. This parameter, C, introduces a penalty term in the cost function that restricts the magnitude of the weight vector that is applied to the kernel-transformed data. In practical terms, this results in control over how sensitive the fitting is to individual data points (and therefore, outliers). In other words, the value of C controls the complexity of the decision boundary shape. The magnitude of the penalty term is inversely proportional to C, so higher values lead to more complex models and lower values lead to simpler or smoother models.
Additionally, some kernels allow for additional control of fitting through kernel-associated parameters. In the case of Gaussian-kerneled SVC's, the width of the individual kernels are controlled with a constant called gamma. Gaussian functions are perhaps most commonly known as the form of normal distributions, with a width proportional to the standard deviation (sigma). Gamma is related to sigma as follows:
Thus, a higher value of gamma equates to a smaller value of sigma, and a narrower distribution. For a fixed value of C, a higher value of gamma amounts to a sharper, more complex decision boundary.
In the figure shown above, the leftmost column shows models that inaccurately fit the data by either failing to recover the closed, enveloping boundary (A and D) or by enclosing too wide of a central region (G). In either case, the model would have a low accuracy in assigning these examples into their correct classes. This represents "underfitting", in that the SVC was too constrained to accurately fit the input data. Models C, F, I, and (to a lesser extent) H represent the converse scenario: the SVC was not sufficiently constrained, and produced a model that fits the input data well (or perfectly) but likely wouldn't perform well on new data as the decision boundary shape is too specific. This phenomenon is known as "overfitting". Models B and E represent reasonable fits to the data, and therefore, suitable choices for C and gamma.
In practice, the selection of values for C and gamma should be done with a more robust and quantitative approach than visually assessing the results of fitting. Additionally, visualization is not convenient for data with higher-dimensionality. A more sound approach is to try a range of parameter values and assess the accuracy of the resulting fit. One problem with this approach lies in the fact that overfitted models will have a high accuracy for classification of the data with which they were trained. To circumvent this issue, datasets can be randomly split into a set for training the data and a set for testing the resulting model. With this approach, both overfitted and underfitted models would lead to low accuracies and can therefore be distinguished from models which more closely represent the underlying distribution. This "train/test split" method can also be thought of as a crude simulation to test the model against new data that it might encounter in subsequent applications.
The iterative scan parameter optimization method (to be described in the following section) will be applied to an example case of classification among three species of Iris flower: Iris setosa, Iris versicolor, Iris virginica. As demonstrated in the photographs below, the flowers of the three species have highly similar outward appearances at first glance.
The classification task will be to identify the species of a flower given measurements of the widths and lengths of its petals and sepals. SVC's will be trained and tested on the famous Iris dataset (shown below), a commonly-used test case for classification algorithms that contains 50 examples for each of the three Iris species mentioned previously. The dataset originally contains measurements in terms of centimeters, but has been converted into standard scores for each feature independently. This step was performed so that all features will have similar scaling and be centered around zero to improve SVC fitting performance.
From the plots above, it is apparent that setosa distinction should cause no real problem, as it is linearly separable in all variable combinations (and can even be isolated based upon either petal variable independently). Conversely, versicolor and virginica are more convoluted and cannot be fully separated with linear boundaries. However, by leveraging the simultaneous use of all four dimensions, and with optimal parameters, SVC's will be able to assign the species with a maximal accuracy of 97.5 %.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.